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Detailed protocol 


To integrate heterogeneous data from different platforms and different studies, a three-step procedure was applied. First, the percentile rank from the above 
"clustering per dataset" procedure was used for the identification of informative genes in the combined datasets. Specifically, the median of percentile ranks 
across datasets was calculated and genes were ordered by the median ascendingly. Then excluding genes in the blacklist, the top 1500 genes were identified 
as informative genes. Within each dataset, for each gene, the normalized expression was corrected for cell cycle effect, donor effect, percentage of 
mitochondrial UMI counts, and DIG signature, and then scaled to z-score. Second, to reduce technical noise such as transcripts drop-out, we partitioned single 
cells into small groups (called mini-clusters hereafter) each of which contained similar cells. This strategy is similar to the MetaCell method, but our pipeline is 
compatible with gene expression data measured in CPM/TPM while MetaCell requires counts data as input. Specifically, within each dataset, the Seurat v3 
pipeline was applied to the z-score matrix of the informative genes. The parameter k for the k-nearest neighbor algorithm was changed from the default value 
20 to 10, and the resolution for Louvain clustering was set to a high resolution of 50 (for datasets with < 500 cells, 25 was used instead). Thus, clusters with 
small sizes were identified as mini-cluster. Then the z-score transformed gene expression was averaged per min-cluster. Thus, the original gene by cell 
expression matrix was converted to the gene by mini-cluster expression matrix. Such matrices of all datasets were combined by column and only genes present 
in all datasets were kept. The combined matrix would be used for downstream analysis. Third, Harmony was applied immediately after PCA, which was based 
on the combined matrix of the informative genes. Then Uniform Manifold Approximation and Projection (UMAP) and clustering (both implemented in the Seurat 
v3 pipeline) were performed on the "harmony space" to identify clusters of mini-clusters (called meta-clusters hereafter). 

The code implemented the pipeline could be found in github (hitps://github.com/Japrin/scPip). 
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